A three-dimensional approach to Visual Speech Recognition using Discrete Cosine Transforms
نویسندگان
چکیده
Visual speech recognition aims to identify the sequence of phonemes from continuous speech. Unlike the traditional approach of using 2D image feature extraction methods to derive features of each video frame separately, this paper proposes a new approach using a 3D (spatio-temporal) Discrete Cosine Transform to extract features of each feasible sub-sequence of an input video which are subsequently classified individually using Support Vector Machines and combined to find the most likely phoneme sequence using a tailor-made Hidden Markov Model. The algorithm is trained and tested on the VidTimit database to recognise sequences of phonemes as well as visemes (visual speech units). Furthermore, the system is extended with the training on phoneme or viseme pairs (biphones) to counteract the human speech ambiguity of co-articulation. The test set accuracy for the recognition of phoneme sequences is 20%, and the accuracy of viseme sequences is 39%. Both results improve the best values reported in other papers by approximately 2%. The contribution of the result is three-fold: Firstly, this paper is the first to show that 3D feature extraction methods can be applied to continuous sequence recognition tasks despite the unknown start positions and durations of each phoneme. Secondly, the result confirms that 3D feature extraction methods improve the accuracy compared to 2D features extraction methods. Thirdly, the paper is the first to specifically compare an otherwise identical method with and without using biphones, verifying that the usage of biphones has a positive impact on the result. ∗[email protected]
منابع مشابه
A system for audio-visual speech recognition
In this work, a system of audio visual speech recognition will be presented. A new hybrid visual feature combination, which is suitable for audio -visual speech recognition was implemented. The features comprise both the shape and the appearance of lips, the dimensional reduction is applied using discrete cosine transform (DCT). A large visual speech database of the German language has been ass...
متن کاملMutual information based visual feature selection for lipreading
Image transforms, such as the discrete cosine, are widely used to extract visual features from the speaker’s mouth region to be used in automatic speechreading and audio-visual speech recognition. Typically, the spatial frequency components with the highest energy in the transform space are retained for recognition. This paper proposes an alternative technique for selecting such features, by ut...
متن کاملEmploying The Complete Face in AVSR to Recover from Facial Occlusions
Existing Audio-Visual Speech Recognition (AVSR) systems visually focus intensely on a small region of the face, centred on the immediate mouth area. This is poor design for a variety reasons in real world situations because any occlusion to this small area renders all visual advantage null and void. This is poor by design because it is well known that humans use the complete face to speechread....
متن کاملImage Redundancy Reduction for Neural Network Classification Using Discrete Cosine Transforms
High information redundancy and strong correlations in face images result in inefficiencies when such images are used directly in recognition tasks. In this paper, Discrete Cosine Transforms (DCTs) are used to reduce image information redundancy because only a subset of the transform coefficients are necessary to preserve the most important facial features, such as hair outline, eyes and mouth....
متن کاملA Comparison of Visual Features for Audio-Visual Automatic Speech Recognition
The use of visual information from speaker’s mouth region has shown to improve the performance of Automatic Speech Recognition (ASR) systems. This is particularly useful in presence of noise, which even in moderate form severely degrades the speech recognition performance of systems using only audio information. Various sets of features extracted from speaker’s mouth region have been used to im...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1609.01932 شماره
صفحات -
تاریخ انتشار 2016